Bioinformatics (Thomas Dandekar, Meik Kunz)

348

these three predictions, for example, if beta strand and helix but no loop region are pre

dicted simultaneously by the three lower-level networks.

Further tricks additionally improve the predictions of this software. In particular, many

sequences with similar structure are automatically added to the question sequence (mul

tiple alignment). Thus, this secondary structure prediction allows an accuracy of up to

80%. This is already very close to the theoretical optimum. The only way to become even

more accurate is to predict the three-dimensional structure at the same time.

Question 14.5

One software is MemBrain (https://www.membrain-nn.de/index.htm; https://www.mem

brain-nn.de/).

Question 14.6

Please search the internet for deep learning and inform yourself. Helpful is also the page:

https://deeplearning.net/. For AlphaGo also on the Internet (https://deepmind.com/

research/alphago; https://www.youtube.com/watch?v=mzpW10DPHeQ).

Question 14.7

Classification models are used in bioinformatics for the classification between two catego

ries (binary), for example for the diagnosis of a disease (sick/healthy). It is important here

to become familiar with a classification table (confusion matrix; TP, FP, FN, TN), but also

to look at the performance metrics (sensitivity, false positive rate, specificity, PPV, NPV,

accuracy, misclassification rate, prevalence, ROC, AUC) for evaluating a classification

model. Here it is also important to know what are, for example, differences between sen

sitivity and PPV, but also between specificity and NPV. For example, let’s imagine: A

person gets a positive (negative) test result from a predictive test that has a sensitivity of

90%, specificity of 99%, a PPV of 80%, and a NPV of 99%. Here, the positive test result

could only be trusted 80% to actually be positive (sick) (20% false positive, so fortunately

healthy), whereas a negative test result could be trusted more to actually be healthy (1%

false negative, so actually sick). Most diagnostic testing procedures take this into account

and, in the case of a positive test result, carry out a second test to confirm the diagnosis

(e.g. mammography screening). On the other hand, a test should in any case be accurate

enough to identify a healthy person with a high probability (here it would be worse to send

home a supposedly healthy person [negative test result] who is in fact sick [false negative]

and thus does not get any helping therapy or infects other people with a virus [e.g.

COVID-19]). In addition, one should think about problems (little data, etc.) in creating a

classification model, but also what requirements a classification model should meet. To

build a predictive model, it is advisable to use a training and test dataset (splitting 80/20%)

and validate the model on at least one independent dataset to better evaluate the predic

tive power.

20 Solutions to the Exercises